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We investigate the equilibrium state of the model of Peng, et al. for molecular breeding. In the model, 
a population of DNA sequences is successively culled by removing the sequences with the lowest binding 
affinity to a particular target sequence. The remaining sequences are then amplified to restore the original 
population size, undergoing some degree of point-substitution of nucleotides in the process. Working in the 
infinite population size limit, we derive an equation for the equilibrium distribution of binding affinity, here 
modeled by the number of matches to the target sequence. The equation is then solved approximately in the 
limit of large sequence length, in the three regimes of strong, intermediate and weak selection. The approximate 
solutions are verified via comparison to exact numerical results. 
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I. INTRODUCTION 



Recent advances in molecular biology have enabled breeding methodologies to be applied at the molecular level to develop 
novel proteins and DNA sequences with particular desired characteristics! ill- The basic idea is simple: in each generation, a 
number of molecules are selected from the population according to some criteria of desirability; they are then diversified (via 
point mutation and/or recombination) and then amplified back to the original population size. This methodology is worthy of 
study not only for its immediate practical importance but also as a forum for exploring general issues in evolution and population 

■ dynamics in a uniquely controllable setting. 

Recently, motivated in particular by the experiment of Dubertret, et al. Ql on the in vitro evolution of DNA sequences, Peng, 
et al. (PGHL) H 

introduced a simple model of molecular evolution. In the experiment, at each round a specified percentage 
of the population was selected to survive based on its binding affinity to a particular target, the Zaorepressor protein. This was 
accomplished by coating a beaker with the Zac-repressor protein into which the DNA was introduced. The beaker was washed 
out, so that only the most strongly bound sequences remained. The degree of selection could be directly controlled by the 
strength of the washing procedure. Amplification, accompanied by mutation, was subsequently performed by multiple stages of 
PCR. The model of PGHL incorporates these basic processes of selection, mutation and reproduction. Each sequence is modeled 
as a string of L nucleotides, each taken from the set of A — 4 "letters", {A,G,C,T}. The binding affinity of the sequence is 
taken to be proportional to the number of nucleotide matches between it and the target sequence. The fraction (1 — <f) with 
the lowest affinities are dropped from the population. The population is then amplified back to its original level, where each 
daughter sequence in the next round is chosen to be descended from one of the surviving sequences, modified by point mutations 
7— I ■ at a rate of [1q per nucleotide. 

PGHL performed simulations on their model and found that, starting from an initial population with low affinity, the average 
affinity of the population rose, eventually reaching equilibrium. They then constructed a mean-field (i.e. infinite population) 
treatment of the problem. They showed that the average affinity relaxes to its equilibrium value exponentially in time, with a 
time constant of A/iq/(A — 1), and verified this via simulation. However, the calculation of the equilibrium value itself was 
less successful. In their treatment, this value was found to depend on a parameter which was introduced in an ad-hoc manner 
into their theory, and which needed to be fit from simulation. In this paper, we reexamine the equilibrium problem, locating the 

■ source of the difficulty of the PGHL treatment in an inappropriate continuum approximation. In turns out that the problem is 
fairly subtle, and in fact requires different treatment in the case of very weak, very strong and intermediate selection. The range 
of the validity of the intermediate selection increases with L, so that for sufficiently large L, the average affinity is given for any 

' <j)by the intermediate result. 

The plan of the paper is as follows. In Section|n] we lay out the general formalism and basic equations that we need to solve. 
In Sec. [HI] we solve the strong selection problem, working in the small /xrj limit. In Sec. II VI we treat the case of intermediate 
strength selection, also for small /io. In Sec. [V] we generalize the intermediate strength solution to the case of arbitrary /io, 
obtaining the central result of the paper, the average density of mismatches for any <f> and /j,o, given that L is sufficiently large. In 
Section fvTI we tackle the weak selection limit, reverting back to small fi. In Sec. IVHI we return again to the case of intermediate 
selection, calculating the next order corrections. We discuss the connection to the PGHL treatment in Sec. I Villi We conclude in 
Sec. lIXI with some observations. 
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II. BASIC FORMALISM 



We are interested in this paper in calculating the equilibrium distribution of binding affinity in the population of sequences. 
Since the dynamics only depends on the binding affinity, determined by the number of matches between the sequence and the 
target, we can characterize each sequence by this quantity, or equivalently, the number of mismatches, < k < L, in the length 
L sequence. We will work in the infinite population limit, so that fluctuations are irrelevant, and focus on the master equation 
for the normalized mismatch distribution, Pj.. We need to track the changes induced in this distribution by each of the two 
processes, selection and mutation; amplification in and of itself having no effect on the distribution. 

Selection of the fraction <fi of the highest affinity sequences throws out those states with the highest fc's, keeping the lower 
k states. Thus, after selection, some set of states < k < n are retained in their entirety, along with some fraction a of the 
k = n + 1 state. The renormalized distribution after selection, P s , is given by 

< k < n 

K = { jPn+i k = n + l (1) 
k > n + 1 




where the equation 



= aP n+1 + J2 p k (2) 



k=0 

implicitly determines n and a. 

If the mutation rate is sufficiently small, we need only consider single nucleotide mutation events, so that only transitions 
k — ► k ± 1 are allowed. We will take up the case of larger mutation rates in Sec. IV11I The distribution after mutation, P M , is 
given by 

pi; = (i - /* + n + i l ( £ -^ ±1 ) ps-j + a- (^r) p^ (3) 

where /i = /i Q L is the genomic mutation rate, v = 1/(A — 1), nu = 1 — v, and we have set Pf j = 0. Mutation repopulates the 
k = n + 2 state, which was emptied by selection. At equilibrium, the post-mutation distribution P£ must reproduce the initial 
distribution Pk, giving us a closed set of equations for P&. These equations involve only the finite set of states < k < n + 2. 
In fact, P n +2 is not involved in any of the equations, since its initial value is zeroed out by selection. Thus, the equilibrium 
equations close on the n + 2 quantities Pk, < k < n + 1, with P n +2 being slaved to the solution for P n +i- This system is 
(0 < k < n + 1) 



k\ ( L — k + I s 
1 - ii + tw-A f k Pk + M f ^ ) fk-iPk-i 

V v [ — — I Jk+iPk+i 



(4) 



with the filling factor fk = 1 for < k < n, f n +i = ct and /_i = f n +2 = 0. These equations for the P's, with a given n and a 
are linear and homogeneous, with <f> playing the role of an eigenvalue condition for the existence of a nontrivial solution. As a 
increases for a given n, <fi increases, implying weaker selection, till a reaches its limiting value of 1. At this point, we increase 
n and start a again at 0. Thus the solution is piecewise analytic in (f>. This piecewise behavior is seen clearly in Fig. Q, where 
the numerical solution for L — 50, /u = 0.1, A = 2 is presented. 

For simplicity here, we will only treat the case of a = 1~, where the last selected level is in fact completely selected. Writing 
N = n + 2 and setting Pjy = for convenience (its true value being determined afterwards), we can write the equations in their 
final form 

k „ ( L - k + 1\ „ fk + 1 



X Pj; = v-P k + I j Pk-i + v [—) p k+i (0 < k < N - 1). (5) 

with the boundary conditions P_i = Pn = 0, where we have introduced the scaled eigenvalue 

X= ■ (6) 

M 

After the solution of this system, the P's can be normalized via = J2k=o Pk' an d tnen Pn can t> e constructed from Pn-i via 
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We see that the system equilibrium is controlled by the single combined parameter \, rather than the two parameters <f> and 
/i separately. Increasing or /x both act to increase x> an d as we shall see, increase the average mismatch. Since the rate of 
approach to equilibrium is governed solely by /i, one can achieve the same equilibrium in less time by increasing /i and the 
selection strength (thereby decreasing <fi) simultaneously. 

We see that there is a special role to the value <f> = 1 — fi, where x = 0. This is the limit of maximal selection, since at this 
point only the k = is selected. Lower values of </> just result in a partial selection of the k = state, which changes nothing 
since, after amplification, the result is the same. Thus, for <fc < 1 — /j, the solution is Pq = 1 — /x, Pi = fi, and the mean mismatch 
k = [i. The other limit of \ is given by the weak selection limit <\> — > 1~, where \ also approaches 1. 

The system of equations Eq. (0 is not analytically solvable in general, but simplifies in the limit of large L, which we now 
proceed to study. 



III. STRONG SELECTION, N < L 



We first consider the case of strong selection, where the number of selected states, N, is much less than the sequence length 
L. The system Eq. (0 can then be written as 



( X -T 

l-L 
L 



V 



X ~ 



\ 



'±1 
L 



_ (N-2)v 
L 

N-2-L 



(N-l)v 
L 

x- '—r 1 - 



( p ° \ 
Pi 

Pi 



Pn-2 
\ P N -X ) 












(8) 



This homogeneous set of equations has a solution when the determinant, Vfj vanishes, providing an eigenvalue condition on 
X- Expanding in minors about the last column gives a recursion relation 



V N = 



v{N - 1) 
L 



T>N-1 



N-2 
L 



1 V 



N-2 



(9) 



Dropping the 
equation 



P(N-l) 



in the first term and the ^-r~^ m tne second term of the right side of Eq. l|9} (because N L) gives the 



V N+1 = yV N - NV 



N-l 



(10) 



where T>n = (L/u) 2 T>n and y = xy/L/v. The solution of Eq. i ll Oi l is the N' h Hermite polynomial Hejv(y) = 2 2 H N (-M=y 
We are thus interested in the zeroes of Hejv(y). In fact, we have to take the maximum root of Hejv(y), which we call y^, because 
any other solution give rise to negative P„'s. Using Maple, for example, to compute the y^'s, we obtain our approximate solution 
for N < L: 



Vn 



(11) 



We now can compute k, the average number of mismatches. Going back to Eq. (|8), one can see that Pi - y/LPo, P 2 - VLPi 
etc., so that Pn-i is the single dominant term and is thus 1 to leading order (P/v is of order /i, and so is also irrelevant in our 
small [i limit). This gives us k « N — 1, so that 



1 



(12) 



This is presented in Fig. 1, along with the exact numerical results for the case L = 50, for the case A = 2 (v = 1). As we 
noted earlier, the numerical results are piecewise continuous, due to the partially filled levels, with the transitions between levels 
apparent as cusps in the curve. We see that the agreement is quite good for the smaller N, and gets worse as N increases, as 
expected. 

It is interesting to investigate the behavior of y* N for large N. We remember that the iV'th Hermite polynomial, times a 
Gaussian, is the solution of the Schroedinger equation for the harmonic oscillator, with u = 1/2 and energy level En = 

(JV+|)w, (h = m= 1): 



1 



1, 



(13) 
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The zeroes of He n are the zeros of the wave function ip. For large N, the maximal zero is near the classical turning point, where 
the energy equals the potential, so that x 2 /8 « N/2, or x — 2y/~N. To achieve a more accurate approximation, we expand the 
potential around the turning point, writing x — 2\fN + a(z — Zq), and dropping the quadratic term gives an Airy equation 

J = (14) 
if we choose a = N~ x / & , z Q = — N~*/ 3 /2, The first zero of the Airy function is at z = —2.338, so that 

y* N = 2ViV - 2.338AT 1/6 + -^N' 1 ' 2 (15) 

. The next order correction due to the dropped quadratic piece can be seen to be of order N~ 2 / 3 . This is an excellent approx- 
imation to y* N even for N = 2, as can be seen in Fig. ^ where this approximation is used to achieve a completely analytic 
expression for 4>(k). 

The upshot of all this is that in the first region the scaling variables are y = 'L/u{4> ~ (1 — A*)) /A* an d Plotting the results 
for different L in these variables demonstrates the collapse beautifully, see Fig. |2] 

The only question left to answer is how far the first region extends. We see from Fig. 1 that for L = 50 it covers roughly 70% 
of the range of above the threshold for single level selection, (/) = 1 — //, which is surprisingly high. The explanation of this 
depends on the results of the next section, with the intermediate range asymptotics. We anticipate the results by noting that the 
leading order correction to N(<f>) is, for A — 2, SN ~ N 2 /L. Thus, the first-region results start failing for 5N/N w 0.05, or 
N/L w 0.05. For large L, x ~ 2y/N/L, we expect that for large L the first-region results are trustworthy out to \ ~ 2%/. 05 ~ 
0.45, i.e.; 45% of the region of interest. For smaller L, if fact things are better as we have seen. It can be shown that a more 
accurate criterion is (N - 2.3387V 1 / 3 ) 2 /(iV£) » 0.05. Applying this to L = 50 gives N w 10 which compares quite well with 
the numerical results in Fig. 1. For v ^ 1, the first region is much narrower, though still a finite fraction of the allowed range of 
X, independent of L. The first region approximation here fails when x ~ 0.1i//P, which for ab = 4, (u = 1/3), reads x ~ -05, 
very much smaller. We can see this in Fig. [5J where again our approximation is over-conservative for smaller L. 



IV. INTERMEDIATE SELECTION, N > 1, L/(l + v ) - N > 1 

When N gets too large, of order L, we can no longer neglect the V ^ N ~^> and ^-j^- terms in the recursion relation for the 
determinants. It is more convenient now to consider the P n 's directly, using the difference equation|5] Motivated by the result 
that, in the first region, the iVs grow rapidly with k, we assume that here too the same rule applies: The relevant P's are those 
with k ~ N. Writing = aL, we thus assume that k ~ aL. Plugging this into Eq. (0, we get the difference equation 



xPk = voiP k + uaP k+ i + (1 - a)P k -i 



(16) 
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FIG. 2: Results for fj, = 0.1, A = 2, and L = 50, 100, 200, 1000, and 5000, plotted using the scaling variables y and q, together with the 
first-region approximation, Eg, 1 121 and the further approximation to this using the Hermite asymptotics, Eg. 1151 
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FIG. 3: Comparison between exact results for L — 50, fi = 0.1, A = 4 and the first-region approximation, Eg. 1121 



with the solution 



where 



P n = AM + SA™ 



A+ = 



X-va 



± y/(x~ vet) 2 — Ava(\ — a) 



2va 



(17) 



(18) 



There are 3 options for A + and A_: 1. They can both be real, but this is incompatible with P-\ = Pn = 0. 2. They can both 
be complex, but then we get an oscillating solution, which is also not acceptable, since the P's must be non-negative. 3. The 
degenerate solution, where A+ = A_ = A, and the solution that satisfies Pn = is 



A\ n (N- n) 



(19) 



This solution indeed decays rapidly on a scale of order 1 as n decreases from N, justifying our initial assumption. It also means 
that the change in the solution from one k to the next is also of order 1, and thus the difference equation can not be approximated 
by a differential equation, as PGHL attempted. 
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FIG. 4: Comparison between the numerical results for the mismatch density, q vs. <j> and Eq. 1231 for L = 50, 100, 1000 with fi — 0.1, A = 2. 



The condition for degeneracy of the A's is 

X = va. + 2-v/i/a(l - a) (20) 

hence 



_iV N N 

4> = MX + 1 - M = 1 - M+ + 2^1/ _ jF") (21) 

The last task is to calculate k. Since the P n 's decay exponentially (with an L-independent width) as n decreases, to leading 
order in 1/L, 

k = N . (22) 

Since here TV = aL, there is a finite probability, q = k/L of a mismatch per nucleotide in the intermediate selection region. 
Combining results, we have 



k 1 



1 = T = 



2v + \v- 1\/v{v + vx~ X 2 ) 



(23) 



A comparison of the formula in Eq. d23i and numerical results is given in Fig. @ for A = 2, and in Fig. for A = 4 
(i/ = 1/3). We see that our formula overlaps that of the first region for a <C 1, so that x ~ = 2WuN / L, which is 

indeed the first region result for large N. The leading correction in the overlap can be calculated, leading to the results discussed 
above in Sec. [HI] The second region solution has q increasing from at X = ((/> = 1 — fx) to the fully random value of 
1/(1 + v) = {A — V) / A &t x = 1 (<t> = 1)) so that this second region result covers the entire range of q and cj>. Indeed, for fixed 
(j), this result is asymptotically correct for sufficiently large L. At small x, our result only breaks down when N is not large, 
which only happens for x ~ 0(L~ x l 2 ). Similarly, it breaks down a is too close to 1/(1 + v), since then A goes to unity and the 
relevant fe's are no longer concentrated in a region around k — N. We shall see in Sec. I VII that this occurs for 1/(1 + v) — a of 
order L -1 / 2 , so that A is of order \J~L. Thus, except for small regions of <f> near the very weak and very strong limits, the size of 
which vanish in the large L limit, our result Eq. i23\ is reliable. 



V. INTERMEDIATE SELECTION, GENERAL fi 



Up to now we have only considered the case of small mutation rate, so that only single transitions had to be considered. We 
now take up the case of large /i, dealing with multiple transitions in a single round. Again, we are working in the intermediate 
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FIG. 5: Comparison between the numerical results for the mismatch density, q vs. <f> and Eq. 1231 for L = 50, 100, 1000 with fi — 0.1, A = 4. 



selection region where only the states with k « AT = aL are relevant. We take the probability of a single site mutation /io to 
be small, but [i = hqL to not necessarily be small, so that we may use the Poisson distribution for the number of up and down 
mutations, considered separately. The mean number of up (increasing mismatch) mutations is fi(l — a) and the mean number of 
down mutations is fiva. Consider the case, for example, where after the mutation phase you wind up at the same level where you 
began. This could result, of course, from having no mutations of any kind, which has a probability e - ^ 1 "") • e^ va = e~^+' icw . 
In addition, however, one can wind up back where one started by having two mutations, one up and one down. The probability 
for this is (e~^+^ QI/ ) • /^(l — a) ■ \iva. One can also return to the original state after 4, 6, . . . mutations. The total probability of 
remaining in the original state is then 



2k 



to ( fc! ) 2 



(ai/(l - a)) k = er^+^Io {2^av{\ - a)) 



where I is the modified Bessel function. Henceforth, we shall write 2[i^J ctv(\ — a) = t. Proceeding likewise, the probability 
of mutating down from state n + j (j > 0) to state n is 



\a\iv\- < 



3 „-p+a 



k=0 



{au{l - ajf n 2k 
k\(k+j)\ 



CtV 

1 - a 



i/2 



I At) 



and up from n — j to n is 



[(1 - «K e 



1 e -P+HOtV 



(av{l - a)) k fj, 2k 
to + 



l-a 



-3/2 



Putting this all together yields the recursion relation (J_j = Ij) 
The characteristic polynomial (here of infinite order) is 

oo 

= -<f> + e -n+nMi-v) ^ A J 



6 ^ n+J \l-a 

j=—oo 



J/2 



(24) 



l-a 



j72 



(25) 
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This sum can be performed (see, e.g. 0|, Eq. 9.6.33), yielding 

F(X; 0) = -<f) + e -M+^(i-^) e (^ra+ w) 5 = o ( 26) 



The condition for degeneracy is ^ = 0, yielding A = y 7^7, exactly as in the small fi case! Plugging this into Eq. J26l > gives 

(f) = e -M+M«5+t (2?) 
_ 6 m(2\/ qi/(1-q)-i) +o^P ^g) 



In the limit of small /i, w 1 + /1 (2^/ai/(l — a) + a(y — lj which is of course what we got before. As with small fj,, the 
mismatch density q is just a to leading order, so that 

= e n(2yfqv(l-q)~l+q(l- V )} ^ ^„ 

Turning this around gives 



(1+^ 



(30) 



A comparison between Eq. (I30i and numerical results for ^ = | and // = 1 is shown in Fig. (|6j. The formula for general /1 
can be seen to reproduce exactly that for small /1 with the replacement (1 — <t>) by — ln(0), with the control parameter in general 
being — ln(</>) ///, so that stronger mutation can always be exactly compensated by stronger selection. 



VI. WEAK SELECTION 



A. Determining x(A) 



We return to Equation describing the equilibrium state. We saw from the second region solution that what N approaches 
L/(l + v), the width of the distribution diverges and one can no longer consider k as a constant. However, the fact that the 
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distribution is very broad (of order \f~L as we will see) means that it changes slowly with k, which allows us to expand Pk+i and 
Pk-i in a Taylor series, yielding: 



L J V 'L L 



1 ( -4 + 1 + i±il N ) p; ^ o. (3i) 



2 V L L 

where x is a continuous variable replacing the discrete variable k in Eq. l|5}- We can eliminate L by going over to the scaled 
variables 

x--±- 

y= ; 6 X = L(l-x), (32) 

yielding to leading order 

-^—P^ + ((1 + v)y) P' y + {5 X + l + v) P y = 0. (33) 
1 + v a " 

We again transform into a Schroedinger equation, via then similarity transformation P y = f y g y with 

(1 + W 2 y~ 

9y = e ^ (34) 



yielding 



v *n , f( l + ») 2 1 + " 



l + ^'H^^ — -«XJA = 0. (35) 
Rescaling z = (1 + v)yj 1 we obtain (yet again) the Schroedinger equation for a harmonic oscillator, 

~ l -r;+ l -z 2 f z = Ef z (36) 

in which rn — H = 1, to = 1/2, with the energy term reading 



E = -+ , 6x r (37) 
4 2(1 + j/) 



The boundary conditions, P_i = Pn = 0, are equivalent to infinite walls at z = —\jLjv and z = ((1 + ^)./V — L)/\fvL. 
Because the factor g z decays rapidly as z decreases away from 0, we can replace the left boundary condition by P(z = — oo) = 0. 
at the cost of an exponentially small error. Thus, for a given x> or equivalently E, as we come in from y = — oo we want to find 
the first z — z* for which /* = 0, which then determines N through 

N=J^(l + z\[^\. (38) 



Thus, as we noted in the beginning, N is a distance of order \f~L from L/(l + v) in the third region. There are of course a set 
of E\ for which we know z* analytically. If E = (n + h)u> for some whole number n, then the solution of the Schroedinger 
equation is a Gaussian times the Hermite polynomial, He n , and so z* in this case is just the largest (negative) root of the He„. 
It is remarkable that, as in the first region, the solution is determined by the zeroes of the Hermite polynomial. However, 
whereas there the quantum level n increased with <f>, here it is just the opposite. Now, the root z* decreases with n, so that as \ 
decreases from 1, TV decreases, as we expect. At n = 1, the wave function has a single root at the origin, so z* vanishes, and so 
N = L/(l + v) at this point. For the case of general E, the solution for / is a parabolic cylinder function, and the root z* has 
to be determined numerically. In particular, for E < |w, z* is positive, so N > L/(l + v). As the ground state wave function 
has no zeros, z* must go off to oo as £ approaches the ground state energy of E — |o;. In fact, the case E = ^uj is precisely 
the limiting case x = 1 where there is no selection and so N = L. The divergence we obtain is consistent with with the fact 
that the exact z* is not order 1, but rather of order \[L at this point, and is a result of our approximation of the left boundary 
condition. Thus, the scaling variables in the third region are E and z* , which are related to the physical variables \ an d N y i a 
Eqs. J37l > and fl38b . The results for the numerical calculation of z* vs. E is presented in Fig. together with the exact solution 
for integer n in terms of the zeros of the Hermite polynomial and the asymptotic formula for large E derived in the section on 
the first region: z* = -(2^- 2.33871-5 + * ) « -(2V2P - 2.083P- 1 / 6 ). 
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FIG. 7: Comparison between the numerical solution of z* = (2N — L) /yL vs. E = ((1 
the exact numerical solution of Eq. {4} for L — 50, 500, with jj, — 0.1, A = 2. 



)L + l)/4 together with the scaled results of 



B. Finding x(N) 

Using the relation between the variables x and z above, we will calculate the average x via 



L VvL 

z (39) 



1 + u 1 
where 

z= j -™ ± ■ (40) 

J-oo f* e 4 dz 

where z* is as above the value of z for which f z = 0. Using the numerical solution for / for various E's allows us to construct 
the scaling function z(E), the scaled fe, (i.e. < - 1+ ^~ L ) vs. the scaled selection parameter, E = 1/4 + (1 — x)L/2(l + v) as 
shown in Fig (|8). 

We can generate an approximation to z for large E. Here as we saw, z* is large and negative and so the Gaussian factor in the 
integral restricts the relevant range of z to those near z* , which as we saw in the Region 1 calculation, is itself near the classical 
turning point zq = —2y2E. Proceeding as in Sec. [22 we introduce z = z$ + Az, A = (— zq/2)~3 , and requiring /(— oo) = 
yields 

f(z) = Ai(-z). (41) 
We do not worry about normalization since we explicitly divide by the normalization integral. This leads to 

/-LOo + Ai)K\(-~z)e- z « Ai d~z 



J^ oo Ai(~S)e- z " Ai dz 

where z* = 2.338 is the first zero of /. Changing the integration variable z to s = z* — z yields: 



(42) 



f n °° e ZoAs / 2 sAUs- z*)ds 
z = -z -A~z*- Jo ; -L-. (43) 
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FIG. 8: Comparison between z, the scaled average mismatch vs. E, the scaled selection parameter from the numerical calculation of Eq. 1401 
and exact numerical results for L = 50, 500, 1000, and 5000, with fj, = 0.1, A = 2. 



Because of the fast decay of the exponential factor, (\zq\ ^> 1), the main contribution for the integral in j43t is around s = 
This allows us to expand Ai(s — z*) in J43l > in a Taylor series: Ai(s — z*) w sAi'(— 5*) + . . .. This gives: 

r~ s 2 eZ aA S /2 ds 4 

Z = Z + ^Z ^ = Z + Az - — - 

= -2V2E + 2.3338 (2E)~ 1/6 - (E/4y 1/3 . (44) 
On the scale of Fig. (|8j, the difference between this formula and the exact numerical calculation is not visible. 



VII. NEXT-ORDER CORRECTION TO INTERMEDIATE SELECTION 



In order to better understand the origin of the degeneracy condition we invoked to solve the second region problem, it is 
worthwhile to explore the next-order correction. In addition, we will be able to understand the relatively poor accuracy of our 
result for q(<p). We will for simplicity restrict ourselves in this section to the case A = 2. In the second region, we could 
not replace the difference equation by a differential equation, since the change in P from step to step was of order 1 . We will 
fix this by rescaling P by the geometrical series behavior we found, by putting Q n = where A is as determined above, 

A = 1/(1 — a)/a, so that the differential equation: 



XPn - (1 - ^-^-)Pn-l + ^-Pn+l- 



(45) 



becomes 



xQnX n = Q„-iA"- 1 (l 



n — 1 , 
L ' 



n.+ l 



n + 1 



(46) 



We now assume the Q's vary smoothly and expand Q n -\ and Q„+i in a Taylor series, (writing n = aL + y), and dropping all 
terms of order L~ l , though not y/L) yielding 



L 



91 
2 



Q' 



Xo-X + 



fly 







(47) 
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8% vs. a 

Second Region. jx=0. 1 




FIG. 9: Comparison between the exact numerical results for 8\ vs a and the formula in Eq. J52> 



where xo = Aa + (1 — a)/A = 2^/a(l — a), 7 = A + 1/A = 
into a Schroedinger equation by a similarity transformation, Q 



2/xo an( J /? = A — 1/A = (1 — 2a)7- As usual, we transform 
= fg, with: 



xo + t f\ 



giving 



- 2 f" + f 



0v 

Xo- f 



2 2 

71 



Xo +/9y/i 2V{ Xo + (3y/Lf 



(48) 



(49) 



where we have again dropped a term of order L . We see that the coefficient of / in Eq. ( I49> vanishes at y = if % is taken 
equal to xo, the zeroth order solution. Thus the zeroth order solution corresponds to choosing the "energy" — x so that the 
classical turning point is at y = 0, or n = N. However, the true boundary condition is that / vanish at n = N, which happens 
some small distance (relative to the length scale for changes of the potential L) behind the turning point. Thus, we must choose 
X slightly less than xo m order to move the turning point to negative y. Assuming y/L and 5\ = X ~ Xo are small, we get an 
Airy equation: 



1 



/" + / 



5x _ _Py_ 

Xo XoL 



The solution of this equation that decays for negative y is 



'-4(3) 



= 



L5x 



(50) 



(51) 



The requirement that the first zero of / lie at y = implies 

_ -2.338/3 2 / 3 (xo/2) 1 / 3 _ -2.338(1 - 2a) 2 / 3 
X " L 2 / 3 ~ (a(l-a)) 1 /6 J L2/3 

A comparison of the formula in J52b and the exact numerical results is given in Fig. (|9Jl. 



(52) 
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Remembering that x + &X — 1+f * yields: 



l-»+j-y/N{L-N) 



2-2.338 



TV L-N 



(53) 



where N — ah. Now, let's remember we got for the second region at zeroth order P n oc (N — n)X n . We can calculate n as 
follows: 



£n=0 nP n 
n=0 r n 



(54) 



with N = aL, and we get, up to exponentially small terms 



n = N - 



L + 2y/N(L- N) 



L-2N 



(55) 



so we get a correction for n — N. Combining this correction with the correction for in J53l >. We get a more accurate graph of 
n as a function of 6. 



VIII. COMPARISON TO PGHL 



Before concluding, we wish to make some comments on the connection of this analysis to that attempted by PGHL. As we 
have noted, their approach was doomed by the assumption of a continuum limit for the distribution function. This is true, as we 
have seen, only in the weak selection region, which is valid only in a very limit range of 4> near 1, of order fi/L, Even there, then- 
answer as stated is incorrect, due to their neglect of the dependence of the effective diffusion constant on the local mismatch. 
If one uses for the diffusion constant that obtained at the average mismatch, one obtains an implicit equation for the average 
mismatch at equilibrium, which agrees with our results for the overlap region between the second and third regions. Neglecting 
this dependence forced PGHL to introduce a parameter r which they fitted numerically to in essence restore the correct diffusion 
constant. Achieving agreement with the asymptotic behavior for weak selection would have dictated a value of r = 2 in the 
infinite L limit. Fitting r at the finite value L = 170 gave them instead a value of r = 2.77. Actually, their formula neglecting 
the mismatch dependence of the diffusion constant and using r actually results in better agreement with the exact results, this 
agreement is the fortuitous outcome of a partial cancellation of the errors induced both by the mismatch dependence of the 
diffusion constant and the continuum approximation. At strong selection, their approach in any case is unreliable. This situation 
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FIG. 11: Comparison between the exact numerical results for x vs <j> and the formulas in Eqs. I53t and I55i for L = 100, fi = 0.1 A = 2. 
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FIG. 12: Comparison of the numerical results for the steady-state for the parameters investigated by PGHL, L — 170, fi = 1.7, A — 4 and 
our second-region theory, Eq. 1301 with the results of the PGHL theory, with r = 2.77 and the corrected version of the PGHL theory, with the 
mismatch dependent diffusion constant and r = 1. 



is summarized in Fig. (I12L where we show the results of our theory together with the stated predictions of PGHL, along with 
the (partially) corrected form of their theory taking account of the mismatch dependence of the the diffusion constant, case 
they considered. We see that the partially corrected form of PGHL agrees near 0=1 and subsequently systematically deviates 
from the simulations. The r = 2.77 original formula numerically does better overall, but is nowhere correct. Nevertheless, 
the basic mechanism responsible for selecting the equilibrium state, namely the degeneracy condition discussed in Sec. IIVI 
was correctly identified by PGHL. This degeneracy condition is intimately connected with the marginally stable nature of the 
dynamic problem| 5 ] . 
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FIG. 13: Average mismatch vs. selection parameter <f> for L — 200, /i = 0.1, A — 2, together with the three different approximations 
appropriate for weak, intermediate and strong selection. 



We have seen that the steady-state equation is quite rich in its behavior, with different behavior in each of the three separate 
regimes of the selection parameter <fi. As a summary, we present in Fig. [0]a graph with all three approximations displayed, 
as well as the exact numerical solution for N = 200, fi — 0.1, A = 2. We see that, as already noted, the strong selection 
approximation covers roughly the first half of the nontrivial range of </>, while the second-order intermediate selection result 
works essentially from the smallest <fr = 1 — /i up to of 0.99. The weak selection theory works in this case already from 
</> = 0.97. 

A few qualitative features stand out from the solution and are worth pointing out. First, the average mismatch is generically 
a finite fraction of the sequence length, except for very strong selection, extremely near the threshold for selecting only the best 
state. Second, the distribution of mismatches is very skewed toward high mismatch. The most probable state is very near the 
bottom of the barrel of the states that survive at all. The high affinity states are exponentially rare in the equilibrium distribution. 
Signs are this behavior are perhaps discernible in the experimental data of Ref. Q|. Last is the exact tradeoff between selection 
strength and mutation rate, due to the existence of a single lumped control parameter. This opens the possibility for optimizing 
the breeding process by increasing the mutation rate to speed the approach to equilibrium, at no cost in the quality of the final 
distribution of affinities. There is of course a limit to this imposed by the finite population size, which we have not considered at 
all in this work. If the selection criteria is too rigid, the finite population size will eventually begin to play a role. 
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